feat(tools): truncate large tool results with workspace spill#489
feat(tools): truncate large tool results with workspace spill#489TatsuKo-Tsukimi wants to merge 3 commits intodataelement:mainfrom
Conversation
Replace naive `conversation[-ctx_size:]` slicing with a walker that treats assistant.tool_calls and its matching role="tool" messages as one atomic block. Naive slicing can leave an orphan role="tool" at the head when the cut lands mid-pair — OpenAI rejects this with "No tool call found for function call output" (issue dataelement#446). New helper `services/history_window.truncate_by_message_count` walks tail backward, identifies blocks, and includes blocks whole or not at all. Orphan tools are silently dropped regardless of budget. Replaces head-only pop guard in: - app/api/websocket.py (web chat) - app/api/feishu.py (feishu channel) Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply path's hardcoded [-10:] is intentional). Tests: 15 covering empty input, budget edges, parallel tool_calls, multi-pair, head/mid orphans, realistic 60-message invariant. Addresses dataelement#446 failure mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b880d1ff67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return tool_content | ||
|
|
||
| # Spill full content to disk | ||
| full_path = agent_workspace / "_tool_results" / f"{call_id}.txt" |
There was a problem hiding this comment.
Use a unique spill filename per tool invocation
Writing spill files as _tool_results/{call_id}.txt can overwrite earlier truncated outputs because call_id is not globally unique across rounds for all providers (for example, the Gemini adapters synthesize IDs like call_1, call_2 each response in backend/app/services/llm/client.py). When a later tool call reuses the same ID, an older in-context marker now points to unrelated content, so a subsequent read_file retrieves the wrong payload and can corrupt the agent’s reasoning/output.
Useful? React with 👍 / 👎.
b880d1f to
1cff15b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1cff15b156
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| try: | ||
| spill_root.mkdir(parents=True, exist_ok=True) | ||
| full_path.write_text(tool_content, encoding="utf-8") |
There was a problem hiding this comment.
Bound growth of spilled tool-result files
Each oversized tool call writes a full payload file under _tool_results/ and there is no retention or cleanup path anywhere in the new truncation flow, so long-running agents that frequently call search/PDF tools will accumulate files without bound until AGENT_DATA_DIR fills up. Once disk pressure hits, subsequent spills fall back to partial excerpts and the model can no longer retrieve full results, so this becomes a production reliability issue rather than just storage overhead.
Useful? React with 👍 / 👎.
Add `agents.context_window_tokens` field (default 50000) and a new
`truncate_by_token_budget` helper that bounds in-context history by both
estimated token cost (primary) and message count (safety cap), preserving
assistant.tool_calls ↔ role=tool pairs intact via the same walker as
truncate_by_message_count.
Why: message-count alone is a wildly variable proxy for token cost — one
50KB tool result eats more context than 100 short user messages. Token
budget gives predictable behavior across heterogeneous traffic; message
cap remains as a safety net against pathological tiny-message floods.
Changes:
- models/agent.py: + context_window_tokens (Integer, default=50000)
+ DEFAULT_CONTEXT_WINDOW_TOKENS constant
- schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000)
- alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS)
- services/history_window.py: + truncate_by_token_budget, refactored
common walker, JSON-serialized char->token estimate via existing
estimate_tokens_from_chars (chars/3 — overestimates safely)
- api/websocket.py: pass tok_budget to helper, raise DB load to
max(ctx_size, 500) so helper has room to choose
- api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
- frontend: AgentDetail Settings slider + i18n + types
10 new tests covering token-budget mode (huge-message dropped, both-bounds
interaction, atomic pair preservation, orphan defense). 25/25 pass.
Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use
DB-level message-count limit only — they don't get token awareness in
this PR but won't crash. Migrating them is follow-up scope.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tool calls returning ~50 KB of search hits or PDF extracts used to enter
the in-context history verbatim, burning tokens for content the model
samples one slice of. New `services/tool_result_truncation.py` spills
oversized payloads to `agent_data/<agent_id>/_tool_results/<call_id>.txt`
and replaces the in-context body with a head excerpt + a marker pointing
the model at `read_file` for retrieval.
Threshold: 4000 estimated tokens (~12 KB at chars/3) — soft-start to give
the model time to learn the read_file follow-up; can tighten to ~2000
once telemetry confirms reliable use of the marker.
Smart head:
- JSON-shape responses with a known array key (results/items/data/
entries/hits/documents) keep metadata + first 5 items + a synthetic
_truncated_<key>_count field, so search/list responses stay useful.
- Otherwise plain head-cut (preferring a line boundary when one is
near the cut point).
- Vision-injected list payloads (image_data markers) pass through.
Sandbox: spill files live under each agent's existing AGENT_DATA_DIR
sandbox; cross-agent leak is prevented by the existing read_file boundary.
Applied at three sites to prevent drift:
- llm/caller.py _process_tool_call (call_llm path)
- llm/caller.py call_agent_llm_with_tools inner loop
- services/heartbeat.py tool loop
agent_context.py rule 3 augmented with a paragraph teaching the model to
recognize the marker and use read_file for full content rather than
fabricating from prior knowledge.
16 new unit tests covering pass-through, spill, utf-8 round-trip,
threshold boundary, JSON-shape preservation, write-failure fallback, and
a realistic 50-result jina_search end-to-end. 41/41 total pass.
Note: 3 tool loops still exist (caller.py:432, caller.py:763,
heartbeat.py:335). Drift prevention is partial here — PR 3 consolidates
them into a single execute_tool_calls helper.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1cff15b to
b01c0f1
Compare
Summary
Tool calls returning ~50KB of search hits or PDF extracts used to enter the in-context history verbatim, burning tokens for content the model samples one slice of. New
services/tool_result_truncation.pyspills oversized payloads toagent_data/<agent_id>/_tool_results/<call_id>.txtand replaces the in-context body with a head excerpt + a marker pointing the model atread_filefor retrieval.Threshold: 4000 estimated tokens (~12KB at chars/3) — soft-start to give the model time to learn the read_file follow-up; can tighten to ~2000 once telemetry confirms reliable use of the marker.
Smart head:
results/items/data/entries/hits/documents) keep metadata + first 5 items + a synthetic_truncated_<key>_countfield, so search/list responses stay usefulSandbox: spill files live under each agent's existing
AGENT_DATA_DIRsandbox; cross-agent leak is prevented by the existingread_fileboundary.Changes
Applied at three sites to prevent drift:
llm/caller.py:_process_tool_call(call_llm path)llm/caller.py:call_agent_llm_with_toolsinner loopservices/heartbeat.pytool loopagent_context.pyrule 3 augmented with a paragraph teaching the model to recognize the marker and useread_filefor full content rather than fabricating from prior knowledge.16 new unit tests (pass-through, spill, utf-8 round-trip, threshold boundary, JSON-shape preservation, write-failure fallback, realistic 50-result jina_search end-to-end).
Stack
PR 3/4 of context-budget hygiene series.
Test plan
pytest backend/tests/test_tool_result_truncation.py— 16/16 pass_tool_results/) holds under A2A relationship pathsNote: 3 tool loops still exist (caller.py:432, caller.py:763, heartbeat.py:335). Drift prevention here is partial — the next PR consolidates them into a single
execute_tool_callshelper.🤖 Generated with Claude Code